YOLOv8-CML: a lightweight target detection method for color-changing melon ripening in intelligent agriculture

Color-changing melon is an ornamental and edible fruit. Aiming at the problems of slow detection speed and high deployment cost for Color-changing melon in intelligent agriculture equipment, this study proposes a lightweight detection model YOLOv8-CML.Firstly, a lightweight Faster-Block is introduced to reduce the number of memory accesses while reducing redundant computation, and a lighter C2f structure is obtained. Then, the lightweight C2f module fusing EMA module is constructed in Backbone to collect multi-scale spatial information more efficiently and reduce the interference of complex background on the recognition effect. Next, the idea of shared parameters is utilized to redesign the detection head to simplify the model further. Finally, the α-IoU loss function is adopted better to measure the overlap between the predicted and real frames using the α hyperparameter, improving the recognition accuracy. The experimental results show that compared to the YOLOv8n model, the parametric and computational ratios of the improved YOLOv8-CML model decreased by 42.9% and 51.8%, respectively. In addition, the model size is only 3.7 MB, and the inference speed is improved by 6.9%, while mAP@0.5, accuracy, and FPS are also improved. Our proposed model provides a vital reference for deploying Color-changing melon picking robots.

(2) We proposed a lightweight detection algorithm based on the YOLOv8n model, called YOLOv8-CML.
It improves detection speed while keeping the model lightweight and is more suitable for running on resource-constrained devices, which can provide some ideas for subsequent deployment.(3) We designed several comparative experiments for detecting discolored melons in the dataset to illustrate the performance difference between different algorithms and demonstrate the recognition effect of the model.
The experimental results show that YOLOv8-CML exhibits excellent overall performance compared to other lightweight algorithms.Therefore, our algorithm can provide suitable technical support for the robotic picking of Color-changing melons in intelligent agriculture.

Data acquisition and labeling
Color-changing melon images were captured from Vegetable High-Tech Demonstration Park, No. 108, Luocheng Street, Shouguang City, Weifang City, Shandong Province, China.The location is marked by a latitude of 36.855968 and a longitude of 118.818556.All images were collected in July 2023 at a distance of 80-120 cm using a vivo X80 phone equipped with a Sony IMX 866 RGBW outsole sensor.Owing to the constrained variety within the training samples and to reduce the risk of model overfitting, we captured images from three angles: directly in front, directly behind, and at the bottom of the Color-changing melon.These images include light and dark variations, branch and leaf shadows, and fruit overlap, which are common in the picking process.Figure 1 shows samples from our Color-changing melon dataset obtained in different scenarios.In total, we collected 1297 original Color-changing melon images, which were saved in JPG format with 4032 × 3024 pixels.We then screened the captured raw images to exclude low-quality images such as over-blurred, overexposed, heavily occluded, and incomplete cases.Finally, we generated a dataset containing 1240 Color-changing melon images.
Color-changing melon (scientific name Trichosanthes cucumerina) is a monoecious, heterogamous plant with a fruiting period of up to four months.A plant can simultaneously yield fruits of varying degrees of maturity, comprising mature, semi-mature, and immature fruits, depending on when the female flowers are pollinated.The exterior color of Color-changing melons gradually changes from green to yellow and eventually to red as maturity increases.The key indicator of ripeness in Color-changing melons is the color of the rind surface.Color-changing melons at the immature stage usually have light or dark green rinds, and some fruits may have finely striped rinds, which make them suitable for eating.As the ripening process progresses, the melon rind color gradually changes from green to orange, a change that extends from the base of the melon to the vines, marking the beginning of the semi-mature stage.By the time it reaches the mature stage, Color-changing melons have a deep red color on their surface, and usually, when these fruits are picked, the seeds are removed for next year's planting.In practical production applications, we need to pick different types of fruits in the same season due to different uses.Using algorithms to acquire information about fruits helps to enable precise location and classification, which in turn offers a reference for the work of automated harvesting.In this study, green rinds were considered to represent immature Color-changing melons, while rinds with some or all of their surface color orange were considered to be semi-mature Color-changing melons, and dark red rinds represented mature Color-changing melons.In order not to affect the extraction of Color-changing melon features and to reduce the computational load, the dataset containing 1240 Color-changing melon images was first compressed to a resolution of 640 × 640 pixels and then randomly split into training set, validation set, and test set on the scale of 8:1:1 so that there are 992 images in the training set, 124 images in the validation set, and 124 images in the

Data augmentor
After completing the Color-changing melon image annotation, we apply the Augmentor tool to extend the training set to augment the fruit features, avoid overfitting, and improve the generalization of the model.The data augmentation approaches comprise adjusting horizontal rotation, left and right mirroring, vertical mirroring, brightness, contrast, random distortion, Gaussian noise, and Gaussian blur.Table 1 provides a description of all enhancement operations, while some enhanced images are shown in Fig. 2.After data augmentation, the images in the training set are expanded from the original 992 to 2142.The number of labels for each set of three categories is shown in Table 2.

YOLOv8-CML model
With the wide extension of computer vision in several fields, target detection algorithms centered on Convolutional Neural Networks (CNNs) are playing an progressively important role in modern agriculture.Among these algorithms, YOLOv8 is a deep learning-based real-time target detection and image segmentation model     (3) Detect head for predicting confidence, category, and anchor frames.YOLOv8 is inspired by VGG, and Backbone references the network structure of CSPDarkNet-53, with the count of channels doubled after the pooling operation.To improve accuracy, Mosaic data augmentation was disabled for the last ten rounds before the end of training.
The neck adopts the CSP idea, which fuses the original feature part and the feature part processed by multiple Conv operations to improve feature extraction.YOLOv8 refers to the design idea of ELAN in YOLOv7, and both Backbone and Neck adopt the C2f structure, which makes the feature information richer.The head utilizes the decoupled head structure in the current mainstream, detaching the classification head from the detection head, and the classification branch continues to use the BCE (binary cross entropy) loss, and the regression branch contains the distribution focus loss 42 , and adopts the idea of Anchor Free.Like YOLOv5, YOLOv8 provides models in n, s, m, l, and x scales based on scaling factors to suit different scenarios.YOLOv8n is the smallest model in the YOLOv8 family, which features fewer parameters and lower hardware requirements.In this study, YOLOv8n is chosen as the foundation model to lower computing and memory expenses and to lighten up the model, thus laying the foundation for deployment on embedded or mobile devices in real-time target detection situations 42 .Therefore, we present a YOLOv8-CML model, which is based on the YOLOv8n model with the goal of increasing the detection speed while maintaining smaller parameters and FLOPs.The structure of the improved model is shown in Fig. 3.

C2f-PConv module
To optimize the increase in parameters and FLOPs associated with the use of the CSP module 43 in YOLOv8, we refer to the approach of the FasterNet 44 model and use the C2f-PConv module in place of the multiple convolutional operations of the CSP module.We introduced the PConv architecture in the C2f module, which, unlike regular convolution, is a partially channel-enhanced Convolution framework that is capable of applying regular www.nature.com/scientificreports/Conv for spatial feature extraction from a specific subset of the input channels while keeping the remainder of the channels unaltered.When consecutive or regular memory accesses are required, PConv usually ignores the information of the other channels and uses only the first or last consecutive channel to represent the whole feature mapping, which reduces feature redundancy.In order to utilize the information of each channel more efficiently, point-by-point state convolution (PWConv) is added to PConv, the principle of which is shown in Fig. 4a, which is able to form a T-shaped Conv pattern in the valid sensory field of the input feature map, which is shown in Fig. 4b.Compared with the regular Conv method shown in Fig. 4c, the special pattern of T-shaped Conv emphasizes the importance of the center position more.The left side of Fig. 5 shows the operation of PConv.PConv selectively performs regular Conv operations on the part of the input channels while reserving the rest of the channels for spatial feature extraction.Typically, when consecutive or regular memory accesses are required, PConv chooses the first or last consecutive channel to represent the whole feature mapping.
To accommodate different cases, we assume that an equal count of channels are used to map the input and output features.Given a feature map F ∈ R h×w×c , the FLOPs are computed as follows for a k × k regular Conv: Assuming that PConv processes only c p channels and the remaining c − c p channels are only constant mapped, the FLOPs for PConv are computed as follows: Taking the empirical hyperparameter r = c p c = 1 4 , from Eqs. ( 1) and ( 2), the computational cost, measured in FLOPs, for PConv, is merely 1/16th of that compared with regular Conv.This is because the Conv operation in this part is the same as the regular Conv operation, and thus PConv, reduces FLOPs by processing only a small portion of the channels while retaining the high FLOP characteristics of the regular Conv.
As shown on the right side of Fig. 5, each C2f-PConv module contains a PConv layer and two Conv layers with kernel size 1 × 1 , which combine to form an entity represented as an inverted residual block, where the second layer contains a larger number of channels.To facilitate the reuse of input elements, a shortcut is created in front of the first layer to recycle the input features.Considering that previous work overused Batch (1)    Normalization and activation layers in the model, this can limit the diversity of features and reduce the overall speed of computation 45,46 .Therefore, the C2f-PConv module only uses the Batch Normalization and ReLU activation function behind the second layer to maintain good performance.

Efficient multi-scale attention
The application of attentional mechanisms in computer vision cannot be ignored.It facilitates the model's learning process by allocating distinct weights to individual sections of the input information.This helps the model to recognize the more significant information in it and improves the performance of the model while mitigating overfitting 47 .Immature color-changing melons exhibit color similarities with canes and leaves.Thus, the differences in shape and texture between the fruit and other background objects become more important.Considering this, using an attention module can effectively add critical information to improve detection accuracy.Efficient Multi-Scale Attention module is an enhanced feature aggregation method 48 , which consolidates specific channels within batch dimensions and partitions the dimensions of channels into several sub-feature groups, and efficiently preserves the channel-specific information.Thus, the spatial semantic features are fairly ensured in the distribution of every feature group.The overall structure of Efficient Multi-Scale Attention is shown in Fig. 6a.The Efficient Multi-Scale Attention mechanism adopts three concurrent paths to fetch the features describing the attention weights in order to group the feature maps.Two paths use 1 × 1 Conv layers, while the third uses one 3 × 3 Conv layer.In the 1 × 1 Conv branch, the channel information is encoded into the global space by a 2D global average pooling operation, and the export of the minimum branch undergoes direct conversion to conform to the type of the relevant dimension before the channel features are united.The function of the 1 × 1 Conv is to encode the information coming from both spatial directions, enabling the model to interact across channels.In 3 × 3 branching, the g lobal space information is also encoded using a 2D global average pooling operation.The 2D global pool operation is used to encode the global information extracted from the feature map while modelling the long-range information as dependencies.
The equation of the 2D global pooling operation is expressed as follows: where z c is the output associated with the c th.
In this study, we add the Efficient Multi-Scale Attention module to C2f-PConv and name it Faster-EMA to replace the C2f module in the Bottleneck section.The improved structure is shown in Fig. 6b.Because the Efficient Multi-Scale Attention module has the ability of cross-space learning, i.e., it has the ability to integrate information from various spatial dimensions, facilitating more overall feature integration.Therefore, replacing the original C2f module with Faster-EMA can extend the depth and acceptable domain of the model and enhance the ability of feature integration to more accurately detect Color-changing melons with different maturity levels.Faster-EMA contains three Conv layers and an attention module.Specifically, the first layer is PConv, which processes only some of the features, and the output channel is equal to the input channel.The second layer is 1 × 1 Conv, and the count of channels is doubled.The third layer is still 1 × 1 Conv, but the count of channels is halved.The fourth layer is the Efficient Multi-Scale Attention module, which matches the count of input channels to the count of output channels after Efficient Multi-Scale Attention module processing, according to Fig. 6a.Finally, the performance of the whole module is by using shortcut connections.It is worth noting that only the 1 × 1 Conv of the second layer applies the Batch Normalization and ReLU activation functions.

Efficient detect
Instead of using the coupling header structure in YOLOv5, YOLOv8 opts for decoupling headers.Each decoupling header includes two branches, one for the classification task and one for the regression task, and adopts the idea of Anchor Free.Each branch comprises two 3 × 3 Conv layers and one 1 × 1 Conv2D layer, so when there are more feature layers, multiple 3 × 3 Conv layers stacked together increase the parameters and FLOPs of the head.The detection head part of YOLOv8n occupies 3.46 GFLOPs of computational complexity, which takes up 42.7% of the total computational complexity of the model.This means that the detection part occupies almost half of the total FLOPs of the model, a fact that hinders the exploration of model lightweight.As shown in Fig. 7, we were inspired by the RetrainNet 49 detection head to propose the idea of shared parameters while balancing the accuracy of detection and model complexity.We replace the two 3 × 3 Conv layers of the two branches in each detection head with one PConv layer and one 1 × 1 Conv layer, which are subsequently employed for the clas- sification and regression tasks, respectively.This partially coupled approach reduces the parameters and FLOPs while maintaining the accuracy of the detections.After the improvement, the parameters of the detection head is 0.2 M, and the computational complexity is 0.46 GFLOPs, which are 22.3% and 12.6% of the original YOLOv8n, respectively, which significantly reduces the parameters and FLOPs of the detection head, and effectively improves the model's lightness.Finally, from the point of view of model improvement, our proposed method is generalizable to arbitrary convolutions for combination, such as replacing the original structure with a combination of one 3 × 3 Conv layer and one 1 × 1 Conv layer or a combination of other types of convolutions.

Improved loss function
This subsection describes the loss function design of the YOLOv8 model and our approach to optimizing the loss function.Compared with previous target detection methods, the YOLOv8 model no longer detects targets in a residual two-stage manner.It is only used for categorization detection and regression detection, which includes category loss and location loss.Concerning the category loss, the YOLOv8 model employs BCE (binary cross entropy) loss and uses the SiLU 50 activation function.For the localization loss, the YOLOv8 model employs the CIoU loss function to compute the IoU between the detected and target frames and also employs DFL 42 to optimize the location loss.
The CIoU loss function definition is denoted as: (4) where b denotes the predicted bounding box, b gt denotes the real bounding box.θ is a weighting function, c denotes the diagonal length of the minimum bounding box, which contains the predicted and real bounding boxes, ρ denotes the computed Euclidean distance between the centers of the real bounding box and the predicted bounding box, and v is used to measure the similarity in the length-to-width ratio.The CIoU loss function not only takes into account the detected frame scale loss but also improves the length and width loss, allowing the detected frame to better match the shape of the real frame.However, the last term v in the formula is not well defined, and since v only reflects the distinction in aspect ratio while ignoring other aspects of similarity, this may cause the CIoU loss function to optimize the similarity in an unreasonable way, thus affecting the model's performance to effectively reduce the true difference between (w, h) and (w gt , h gt ) .To solve this problem, this study draws on the approach of α-IoU 51 to optimize the loss function.α-IoU is an IoU Loss based on the Box-Cox transform, which generalizes the existing IoU-based Loss to an innovative family of Losses termed Power IoU, comprising a power IoU component and a supplementary power regularization term.α-IoU increases the accuracy of bounding box regression by adaptive reweighting and is suitable for lightweight models for better detection.
The changed formula is shown below: where α is a hyperparameter that can be tuned for the detector to be more agile to deal with variable levels of Bbox regression precision.According to the authors' suggestion, target detection performs better when α = 3.
In this study, we name this optimized loss function as α-CIoU.

Training set-up
The experimental operating system used in this study is Ubuntu 18.04, while PyTorch serves as the underlying framework for the developed deep learning model.The detailed environment is shown in

Evaluation indicators
To thoroughly and impartially evaluate the improved model's performance, this study employs Precision, Recall, mAP, model size, parameters (Params), FLOPs, and frames per second (FPS).TP denotes true correct (count of Color-changing melons correctly detected as the positive category), FP denotes false positive (count of Color-changing melons falsely detected as the positive category), and FN denotes missed negative (count of Color-changing melons actually belonging to the positive category but incorrectly detected as the negative category).
P denotes the ratio of correctly detected targets in the positive category to the total count of targets predicted by the model belonging to the positive category, defined as: R denotes the ratio of correctly predicted targets in the positive category to the total count of targets in the total count of real positive categories, defined as: AP calculates the average precision of a single category under various recall ratios, defined as: The mAP is a comprehensive measure of the ability of the target detection model to perform with multiple categories, which takes the average of all the category APs, defined as:  www.nature.com/scientificreports/Higher mAP values indicate better recognition performance of the algorithm.mAP@0.5 indicates the IoU threshold is the average of the APs for each category at 0.5.mAP@0.5:0.95indicates that the compute IoU threshold ranges from 0.5 to within 0.95 in steps of 0.05, the average of the APs for each category.The mAP@0.5 and mAP@0.5:0.95 are defined as follows, respectively : In terms of model complexity, we used model size, parameters, and floating point operations 52 as performance metrics for the model.Params are the overall count of trainable parameters in the model, FLOPs represent the quantity of floating point operations, and model size indicates the memory usage.The smaller these three parameters are, the simpler the model structure is, i.e., the more lightweight it is.
For detection speed, we use FPS to assess the real detection speed of the algorithm, defined as: From the above equation, the reciprocal sum of preprocessing, inference, and non-maximal suppression (NMS) consumption time is defined as FPS, and a larger FPS value illustrates the superior real-time detection ability of the model.

Results and analysis
After the model was trained, we tested it on the test set.To assess the effectiveness of the model improvements, we compared the original and final improvements in detecting overlapping fruits, fruits occluded by leaves, and mature fruits with similar colors to the background in Fig. 8.
Compared to the original YOLOv8n, the YOLOv8-CML model uses cross-space multiscale attention to improve the detection confidence against the leaf background in the case of the many green immature fruits in Figs. 1 and 2. In the case of denser fruit in Figs. 3 and 4, the YOLOv8-CML model improves the confidence of semi-mature and mature fruit with relatively low confidence in the case of overlapping fruits in Fig. 5 the YOLOv8-CML model improves the confidence level for the occluded fruits.
Overall, the YOLOv8-CML model fluctuates in confidence for semi-mature fruits, which is related to the lightweight of the network structure.As the number of parameters decreases, the parameters related to semimature fruits decrease, but this has little effect on the detection accuracy in regular scenarios.It significantly improves fruit overlap and background interference scenarios.
Figure 9 compares the confusion matrices between the YOLOv8n model and our proposed YOLOv8-CML model.It depicts the accuracy of detecting fruit ripeness for the three categories in the dataset.In the confusion matrix, the rows represent the model's predicted labels for each class of instances, the columns represent the  Dark colors denote high rates, and light colors denote low rates.This visualization facilitates the observation of distinctions among instances belonging to various categories.As can be seen, our model features a comparatively high total number of values on the diagonal and reduces the false detections caused by the similarity of immature fruits with the background features while improving the detection accuracy for semi-ripe fruits.
To visually show the effect of the improved model, we plot the hotspots of interest for the two models in two common scenarios, more dispersed fruit, and more concentrated fruit, in Fig. 10.Warmer-colored regions in the heatmap represent regions where the model is convinced of the presence of Color-changing melons.On the contrary, regions with colder colors represent regions in which the model is not confident in its prediction.It is worth noting that regions with lower heat maps do not necessarily mean that these regions do not contain Color-changing melons and that Color-changing melons may still sometimes be present in these lower confidence regions, depending on how well the features of these regions match the features of Color-changing melons that the model has learned.In Figs. 1 and 4, YOLOv8n's hotspots are more dispersed in the case of more dispersed fruit.In contrast, the hotspots of YOLOv8-CML were uniformly distributed in the fruits.In Figs. 2 and 3, the  www.nature.com/scientificreports/fruits are relatively concentrated, but YOLOv8n's hotspots are more concentrated in the fruits and pay less attention to the bordered fruits, while YOLOv8-CML's hotspots are more evenly distributed.This is also a side note that the efficient multi-scale attention aggregates multi-dimensional spatial information and enhances the model's ability to extract features.
To assess the improved algorithm's effect, we designed six sets of ablation experiments, which were used to compare the effectiveness of various improvement strategies on the model performance.We tested using the same devices and dataset, setting the IoU threshold to the default 0.65 in the test and keeping the other hyperparameters at the same default values to ensure comparable results.Table 4 demonstrates the outcomes of the ablation experiments.
First, we use the C2f-PConv module instead of the original C2f module to reduce the model's parameters and FLOPs.The improved model slightly decreases mAP and accuracy, but its complexity is also reduced.Secondly, we replaced the C2f-PConv in the Backbone part with EMA-Faster to improve the algorithm's recognition ability.Although the parameters and FLOPs have only slightly changed, the model's recognition performance has experienced a notable improvement.Next, we use only Efficient detection to decrease the model's parameters and FLOPs.We can observe a significant decrease in these parameters and FLOPs.
Furthermore, we adjust the accuracy of model detection by modifying only IoU.The α-CIoU improved the model's detection accuracy by about 2% while keeping mAP unchanged.Finally, we combine all the improvement methods to achieve the optimal balancing effect.The results showed that mAP@0.5:0.95decreased by 0.6% and mAP@0.5 and Precision improved by 1.4% and 2.4%, respectively.The parameters and FLOPs of the YOLOv8-CML model are significantly decreased compared to the YOLOv8n model by 1.29 M and 4.2 GFLOPs, respectively.
The ablation experiments show that mAP@0.5 achieves satisfactory results by modifying different algorithm sections.However, the improved algorithm's recognition ability shows a marginal decrease compared to the standard model under stricter evaluation criteria.Overall, the proposed modifications to the original model are effective because they significantly reduce the algorithm's complexity while maintaining its central performance, thus improving its efficiency and portability.
Convolution is one of the key components in Convolutional Neural Networks.Different Conv structures have different functions, but all are essentially used to extract features.To compare the performance of various convolutional methods, we reconfigured them into the form of convolutional blocks and then tested them on an experimental platform.We first run each Conv block 1000 times in order to let the GPU "warm up," fully utilize its performance, and reduce the error caused by the difference in hardware power in the comparison process.This part of the data is not counted in the final results.Then, we immediately continue to run the Conv block 3,000 times to count its various indexes, including running time, parameters, and FLOPs.The convolutions involved in the comparison include Conv2D, DWconv2D (Depth Separable Convolution) 53 , GSConv2D 54 , DSConv2D (Distributed Offset Convolution) 55 , PConv2D (Fast Convolution) 44 , DCNV2 (Deformable Convolution) and DCNV3 (Deformable Convolution) 56 .Table 5 shows the comparison results of the various convolutions.We can observe that the best overall result is PConv2D with a mean time of 2.25 ms, followed by DWConv2D, which also works very well with a mean time of 3.35 ms, and the worst result is the ordinary Convolution Conv2D.PConv2D is not only fast but also has the least mean time.It is worth mentioning that in most cases, PConv is not used on its To objectively assess the overall performance of the improved algorithm, we compared our proposed method with the present lightweight detection algorithms using the same experimental circumstances with the metrics of Precision, Recall, mAP, model size, parameters (Params), FLOPs, and FPS.The experimental results are summarized in Table 6.
Under mAP@0.5, our proposed method shows promising results compared with YOLOv3-tiny, YOLOv5n, YOLOv5s, and YOLOv7-tiny.In the range of mAP from 0.5 to 0.95, our proposed method also outperforms most lightweight detection algorithms.The YOLOv5n model is close to our proposed method regarding parameters, computational number, and model size but performs poorly regarding mAP and precision.The YOLOv5s model outperforms the YOLOv5n model in terms of some performance aspects but has a lower frame rate when detecting and relatively high model complexity.The large parameters and FLOPs of the model also limit the YOLOv6n model.YOLOv7-tiny shows high performance in detection speed, but concerning mAP and precision, it is not a proper match.
After optimization by our proposed method, the improved model's parameters and FLOPs are significantly reduced, 42.9% and 51.8% less than the original YOLOv8n model, respectively.In addition, our improved model has the smallest size while improving in terms of mAP@0.5, accuracy, and FPS.Although our improved model is slightly deficient in mAP@0.5:0.95, it has obvious advantages in all other aspects, so the loss in this part is acceptable.
Comparative results show that our proposed method achieves a better balance between accuracy and lightweight features in the Color-changing melon detection task.This important approach lays the foundation for subsequent research.

Discussion
Deep learning-based target detection algorithms are practical for high-precision fruit picking and disease detection tasks in agriculture, a popular method often used in smart agriculture.Fast decision-making is required while working for resource-constrained devices, such as small cameras on robotic arms and micro drones.However, the complexity of the network structure of the model makes it difficult during its actual deployment.Therefore, there is a great need to consider the balance between the real-time and lightweight nature of models in agricultural inspection tasks.This study uses a lightweight YOLOv8-CML model designed for color-changing melon maturity detection.The YOLOv8-CML model performs well compared to existing lightweight models, but this study still has limitations.
The model we design must fit into a natural Color-changing melon-picking environment to make this study practical.The figures we collected were obtained from a cell phone camera taken at an 80-120 cm distance.In contrast, the small camera of the robotic arm may be closer to the fruit, depending on where the camera is mounted on the robotic arm.The purpose of using a cell phone is to capture figures from as many different angles as possible, such as occlusion and backlighting, which may occur in real-world environments, thus enriching the Color-changing melon dataset and improving the robustness of the model.However, the difference in shooting distance may affect the model's performance.In order to solve this problem, we need to study the model further and do more testing and optimization in future research to improve its robustness and better handle colorchanging melon image data from different distances and angles.

Conclusion
In this study, we propose a lightweight detection model, YOLOv8-CML, to help automatic picking robots identify Color-changing melon fruits and deploy them to edge devices more efficiently by addressing the problems of slow detection of Color-changing melon fruits and high deployment cost in smart agricultural devices.Compared to the YOLOv8n model, the parametric and computational ratios of the YOLOv8-CML model decreased by 42.9% and 51.8%, respectively.In addition, the model size is only 3.7MB, and the inference speed is improved by 6.9%, while the mAP@0.5 and accuracy are improved by 1.4% and 2.4%, respectively.In a comprehensive comparison, Table 6.Comparative results of different lightweight algorithms.
Model mAP@0.5 (%) mAP@0.our proposed method achieves a better balance between model complexity and detection accuracy, which provides some ideas for deploying Color-changing melon detection models on edge devices.In our future work, we will further explore simplifying the model while maintaining its robustness, such as compressing the model using knowledge distillation and improving the model running speed to make it more suitable for deployment on mobile or embedded devices to provide technical support for picking robots.

2 Figure 7 .
Figure 7.Comparison of the structure of the detection head before and after improvement.

Figure 8 .
Figure 8.Comparison of recognition effect original and final improvement.

Figure 9 .
Figure 9.Comparison of confusion matrices before and after improvement.

Figure 10 .
Figure 10.Visualization of heatmaps original and final improvements.
test set.Next, the images are labeled utilizing the LabelImg image annotation tool.For each image, the labels are stored in PASCAL VOC format and YOLO format to suit different algorithms.

Table 2 .
Number YOLOv8 is the newest version of the YOLO family, which improves its comprehensive performance compared to previous versions while providing a unified framework for model training.In addition, it is suitable for an extensive range of target detection, target tracking, instance segmentation, image classification, and so on.The model structure of YOLOv8 is divided into three main parts: (1) Backbone for extracting target features; (2) Neck for fusing multi-scale feature information; of labels per set of three categories.Category Immature Mature Semi-mature Total Vol.:(0123456789) Scientific Reports | (2024) 14:14400 | https://doi.org/10.1038/s41598-024-65293-wwww.nature.com/scientificreports/released in January 2023 by Ultralytics.For now,

Table 3
. The hyperparameters of the model training phase include input image size is 640 × 640 , the batch size is 64, the optimizer is SGD stochastic gradient descent, and the number of training rounds is 300 epochs.The learning rate is initialized to 0.01, with momentum and weight decay values set at 0.937 and 0.0005, respectively.All other training parameters use the default values of the YOLOv8n model.In addition, the official pre-training weight files provided in this study were used to enhance the generalization of the trained model.

Table 4 .
Ablation experiment results.A indicates using the C2f-PConv module, B indicates using the EMAfaster module, C indicates using Efficient Detection, and D indicates using α-CIoU.

Table 5 .
Comparison results for different Convs. .It is combined with other Convs and embedded in the model by composing a Block form, e.g., Faster Block.Given the excellent performance of PConv, we chose to use it as a core component of the improved model for the C2f module.For the same reason, we tried combinations of different kinds of convolutions in the detection head, and after comparing them in several experiments, we finally chose the combination of PConv and Conv.